Recall the mtcars dataset we work with before, which compirses fuel consumption and other aspects of design and performance for 32 cars from 1974. The dataset has 11 dimensions, that is more than it is possible to visualize at the same.
head(mtcars)
prcomp() to compute a PCA for mtcars. Remember to set the scale parameter, as the variables are in different units and have different rangesmtcars.pca <- prcomp(mtcars, scale=TRUE)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Percent of variance explained:
fviz_eig(mtcars.pca)
eig <- mtcars.pca$sdev^2
(var.exp <- 100*eig/sum(eig))
## [1] 60.0763659 24.0951627 5.7017934 2.4508858 2.0313737 1.9236011
## [7] 1.2296544 1.1172858 0.7004241 0.4730495 0.2004037
fviz_pca_biplot(mtcars.pca) + coord_fixed()
mtcars.pca$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## mpg -0.3625305 0.01612440 -0.22574419 -0.022540255 0.10284468 -0.10879743
## cyl 0.3739160 0.04374371 -0.17531118 -0.002591838 0.05848381 0.16855369
## disp 0.3681852 -0.04932413 -0.06148414 0.256607885 0.39399530 -0.33616451
## hp 0.3300569 0.24878402 0.14001476 -0.067676157 0.54004744 0.07143563
## drat -0.2941514 0.27469408 0.16118879 0.854828743 0.07732727 0.24449705
## wt 0.3461033 -0.14303825 0.34181851 0.245899314 -0.07502912 -0.46493964
## qsec -0.2004563 -0.46337482 0.40316904 0.068076532 -0.16466591 -0.33048032
## vs -0.3065113 -0.23164699 0.42881517 -0.214848616 0.59953955 0.19401702
## am -0.2349429 0.42941765 -0.20576657 -0.030462908 0.08978128 -0.57081745
## gear -0.2069162 0.46234863 0.28977993 -0.264690521 0.04832960 -0.24356284
## carb 0.2140177 0.41357106 0.52854459 -0.126789179 -0.36131875 0.18352168
## PC7 PC8 PC9 PC10 PC11
## mpg 0.367723810 -0.754091423 0.235701617 0.13928524 -0.124895628
## cyl 0.057277736 -0.230824925 0.054035270 -0.84641949 -0.140695441
## disp 0.214303077 0.001142134 0.198427848 0.04937979 0.660606481
## hp -0.001495989 -0.222358441 -0.575830072 0.24782351 -0.256492062
## drat 0.021119857 0.032193501 -0.046901228 -0.10149369 -0.039530246
## wt -0.020668302 -0.008571929 0.359498251 0.09439426 -0.567448697
## qsec 0.050010522 -0.231840021 -0.528377185 -0.27067295 0.181361780
## vs -0.265780836 0.025935128 0.358582624 -0.15903909 0.008414634
## am -0.587305101 -0.059746952 -0.047403982 -0.17778541 0.029823537
## gear 0.605097617 0.336150240 -0.001735039 -0.21382515 -0.053507085
## carb -0.174603192 -0.395629107 0.170640677 0.07225950 0.319594676
fviz_contrib(mtcars.pca, choice = "var", axes = 1)
fviz_contrib(mtcars.pca, choice = "var", axes = 2)
We will generate synthetic clustered data to use for k-means clustering.
set.seed(489576)
N <- 1000
C1 <- data.frame(cluster = "C1", x = rnorm(n = N, mean = 1), y = rnorm(n = N, mean = 1))
C2 <- data.frame(cluster = "C2", x = rnorm(n = N, mean = -2), y = rnorm(n = N, mean = -5))
C3 <- data.frame(cluster = "C3", x = rnorm(n = N, mean = 5), y = rnorm(n = N, mean = 1))
DF <- rbind(C1, C2, C3)
ggplot(DF, aes(x, y, color = cluster)) +
geom_point()
kmeans.res <- kmeans(x = DF[, -1], centers = 3)
kmeans.res$centers
## x y
## 1 -2.0431190 -4.9932291
## 2 5.0098346 0.9780214
## 3 0.9614141 1.0058486
table(kmeans = kmeans.res$cluster, true = DF$cluster)
## true
## kmeans C1 C2 C3
## 1 0 999 0
## 2 32 0 977
## 3 968 1 23
library(ggplot2)
DF$kmeans <- factor(kmeans.res$cluster)
ggplot(DF, aes(x, y)) +
geom_point(alpha = 0.5, aes( color = kmeans)) +
geom_point(data = data.frame(x = kmeans.res$centers[, 1],
y = kmeans.res$centers[, 2]), size = 3, aes(x, y), color = "Black")
kmeans.res2 <- kmeans(x = DF[, -1], centers = 4)
kmeans.res2$centers
## x y kmeans
## 1 5.096076 0.9598185 2.00000
## 2 -2.043119 -4.9932291 1.00000
## 3 0.331205 0.6876558 3.00000
## 4 1.912323 1.4071695 2.90249
DF$kmeans2 <- factor(kmeans.res2$cluster)
ggplot(DF, aes(x, y, color = kmeans2)) +
geom_point(alpha = 0.5)
In this exercise you will you use a dataset published in a study by Khan et al. 2001 to perform a hierarchical clustering of the patients in the study based on their overall gene expression data.
This data set consists of expression levels for 2,308 genes. The training and test sets consist of 63 and 20 observations (tissue samples) respectively.
Here, we will use the train set, as we now are only interested in learning how hclust() works. First, load the ISLR where the data is available. The gene expression data is available in an object Khan$xtrain; you can learn more about the data set by typing in ?Khan after loading ISLR package.
library(ISLR)
gene.expression <- Khan$xtrain
dim(gene.expression)
## [1] 63 2308
D <- dist(gene.expression)
khan.hclust <- hclust(D, method = "average")
Khan$ytrain, check if the clustering performed groups the observations from same tumor class nearby.plot(khan.hclust, labels = Khan$ytrain)
The files are data on the 28x28 pixel images of digits (0-9). The data is composed of:
label column denoting the digit on the imagepixel0 through pixel783 contain information on the pixel intensity (on the scale of 0-255), and together form the vectorized version of the 28x28 pixel digit imageDownload the data from the course repository:
# load the already subsetted MNIST data.
mnist.url <- "https://github.com/cme195/cme195.github.io/raw/master/assets/data/mnist_small.csv"
train <- read.csv(mnist.url, row.names = 1)
dim(train)
## [1] 1000 785
train[1:10, 1:10]
# compare with pca
pca <- prcomp(train[,-1])
coord.pca <- data.frame(pca$x[, 1:2])
coord.pca$label <- factor(train$label)
ggplot(coord.pca, aes(x= PC1, y = PC2)) + ggtitle("PCA") +
geom_text(aes(label = label, color = label), alpha = 0.8)
# Use tsne
library(Rtsne)
set.seed(123) # for reproducibility
tsne <- Rtsne(train[,-1], dims = 2, perplexity=30,
verbose=FALSE, max_iter = 500)
coord.tsne <- data.frame(tsne$Y)
coord.tsne$label <- factor(train$label)
ggplot(coord.tsne, aes(x= X1, y = X2)) + ggtitle("tSNE") +
geom_text(aes(label = label, color = label), alpha = 0.8)
tSNE seems to be much better at separating digits from each other
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rtsne_0.15 ISLR_1.2 factoextra_1.0.7 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.16 purrr_0.3.4 haven_2.3.1
## [5] carData_3.0-4 colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2
## [9] htmltools_0.5.0 yaml_2.2.1 rlang_0.4.7 ggpubr_0.4.0
## [13] pillar_1.4.6 foreign_0.8-75 glue_1.4.1 withr_2.2.0
## [17] readxl_1.3.1 lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0
## [21] ggsignif_0.6.0 gtable_0.3.0 cellranger_1.1.0 zip_2.0.4
## [25] evaluate_0.14 labeling_0.3 knitr_1.29 rio_0.5.16
## [29] forcats_0.5.0 curl_4.3 broom_0.7.0 Rcpp_1.0.5
## [33] scales_1.1.1 backports_1.1.8 jsonlite_1.7.0 abind_1.4-5
## [37] farver_2.0.3 hms_0.5.3 digest_0.6.25 stringi_1.4.6
## [41] openxlsx_4.1.5 rstatix_0.6.0 dplyr_1.0.1 ggrepel_0.8.2
## [45] grid_3.6.2 tools_3.6.2 magrittr_1.5 tibble_3.0.3
## [49] crayon_1.3.4 tidyr_1.1.1 car_3.0-8 pkgconfig_2.0.3
## [53] ellipsis_0.3.1 data.table_1.13.0 rmarkdown_2.3 R6_2.4.1
## [57] compiler_3.6.2
Read the data from "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" containing information on sales of a product and the amount spent on advertising using different media channels.
Generate a scatterplot of sales against the amount of TV advertising and add a linear fit line.
Now make a 3D scatterplot with axes corresponding to 'sales', 'TV' and 'radio'.
The dataset has 200 rows. Divide it into a train set with 150 observations and a test set with 50 observations, i.e. use sample(1:200, n = 150) to randomly choose row indices of the advertising dataset to include in the train set. The remaining indices should be used for the test set. Remember to choose and set the seed for randomization!
Fit a linear model to the training set, where the sales values are predicted by the amount of TV advertising. Print the summary of the fitted model. Then, predict the sales values for the test set and evaluate the test model accuracy in terms of root mean squared error (MSE), which measures the average level of error between the prediction and the true response. \[RMSE = \sqrt{\frac{1}{n} \sum\limits_{i = 1}^n(\hat y_i - y_i)^2}\]
Fit a multiple linerar regression model including all the variables 'TV', 'radio', 'newspaper' to model the 'sales' in the training set. Then, compute the predicted sales for the test set with the new model and evalued the RMSE.
Did the error decrease from the one correspodning to the previous model?
Look at the summary output for the multiple regression model and note which of the coefficient in the model is significant. Are all of them significant? If not refit the model including only the features found significant. Which of the models should you choose?
Data was collected on doctor visits from a sample of 5,190 people in the 1977/1978 Australian Health Survey. Cameron (1986) sought to explain the variation in doctor visits using one or more explanatory variables. The data can be found in an R data set from library(AER) accessible with the command data("DoctorVisits"). Variable descriptions can be found under help("DoctorVisits")
Explore the use of a zero-inflated model for this data. Begin with a histogram of the number of visits, some EDA, and fitting several models. Summarize your results.
Recall the movies data-frame we used ealier in the bootcamp. It contains information on movies from the last three decates, which was scrapped from the IMDB database.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url <- "https://raw.githubusercontent.com/Juanets/movie-stats/master/movies.csv"
movies <- tbl_df(read.csv(url))
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Generate a boxplot of runtimes for action movies and commedies with jittered points overlaid on top. You might consider setting collor, fill and alpha arguments to modify clarity and transparency of the plot.
Test a hypothesis that the action movies have higher mean runtime (length) than the comedies. Is the difference statistically greater than zero at significance level \(\alpha = 0.05\)?
Test the hypothesis that the scores are the same across movie types.